Singing Voice Conversion with Disentangled Representations of Singer and Vocal Technique Using Variational Autoencoders

Yin-Jyun Luo$^{1, 3}$, Chin-Cheng Hsu$^{2}$, Kat Agres$^{3,4}$, Dorien Herremans$^{1,3}$

$^{1}$Singapore University of Technology and Design
$^{2}$University of Southern California
$^{3}$Institute of High Performance Computing, A*STAR, Singapore
$^{4}$Yong Siew Toh Conservatory of Music, National University of Singapore
$\tt yinjyun\_luo@mymail.sutd.edu.sg$

Many-to-Many Singer and Singing Technique Conversion

The following are the audio samples of Fig. 2 in the paper.

The audio files below are all converted from Mel-spectrograms using Griffin-Lim.
Therefore, the waveforms "original Mel-spectrogram" are the upper bounds of the audio quality for each conversion.

Fig. 2(a) Singer Conversion

First row in Fig. 2(a)

m3 belt
Original Mel-spectrogram
Reconstructed Mel-spectrogram
Convert to f1
Convert to f2
Convert to f4
Convert to m4
Convert to m6

Second row in Fig. 2(a)

f4 breathy
Original Mel-spectrogram
Reconstructed Mel-spectrogram
Convert to f1
Convert to f2
Convert to m3
Convert to m4
Convert to m6

Fig. 2(b) Singing Technique Conversion

First row in Fig. 2(b)

m8 lip trill
Original Mel-spectrogram
Reconstructed Mel-spectrogram
Convert to belt
Convert to breathy
Convert to vibrato
Convert to vocal fry

Second row in Fig. 2(b)

m9 straight
Original Mel-spectrogram
Reconstructed Mel-spectrogram
Convert to belt
Convert to breathy
Convert to lip trill
Convert to vibrato
Convert to vocal fry

Addtional Samples - Singer Conversion

f2 straight
Original Mel-spectrogram
Reconstructed Mel-spectrogram
Convert to f1
Convert to f2
Convert to m3
Convert to m4
Convert to m6

Addtional Samples - Singing Technique Conversion

f4 breathy
Original Mel-spectrogram
Reconstructed Mel-spectrogram
Convert to belt
Convert to lip trill
Convert to vibrato
Convert to vocal fry